The oriignal data used in this tutorial can be found on the GitHub page of Rforwards: https://github.com/forwards/teaching_examples/tree/master/AFLW.

What you will learn

In this tutorial you will learn:

  • The difference between categorical, discrete, and continuous variables
  • How to summarise and graphically display each one them separately
  • How to summarise and plot these two types of variables together

You will learn these statistical concepts and techniques by exploring the AFL Women dataset taken from the 2017 and 2018 season.

Categorical, Discrete, and Continuous variables

We refer to a variable as to a set of observations. For example, imagine collecting the Age from all students in your class. The list of all the ages of your friends can recorded into a column of an excel spreadsheet and you will refer to it as to variable Age. Each entry ( = row, age for one student) of the variable age is referred to as observation.

  • Categorical variables contain a finite number of categories or distinct groups. For example, the name of the football team, the gender of the player, the colour of the team. These variables are not intrinsically number.

  • Discrete variables are numeric variables that have a countable number of values between any two values. A discrete variable is always numeric. For example, the number of customer visiting a pharmacy in a day, the number of players in a team, the number of siblings per student in your class.

  • Continuous variables are numeric variables that have an infinite number of values between any two values. A continuous variable can be numeric or date/time. For example, the heigths of trees in your school, the time when you wake up in the morning.

Let’s read the AFLW spreadsheet into R and test your understanding of the different types of variables.

Note: Each function that you use in R belongs to a package that you need to lead through before you can use that function

library(readr) # Load the package 'readr' in order to read .csv files into R
players <- read_csv("data/players.csv")
colnames(players) <- gsub(" ","_",colnames(players))
colnames(players)[colnames(players) %in% "Time_On_Ground_%"] <- "Time_On_Ground_prop"

Print the first 5 roes of the players dataset.

library(knitr) # package knitr allows to print a dataset on screen in a nicer way. Compare the two ways below.
head(players)
## # A tibble: 6 x 45
##   Player           Club  Kicks_TOT Kicks_AVG Handballs_TOT Handballs_AVG
##   <chr>            <chr>     <int>     <dbl>         <int>         <dbl>
## 1 Aasta O'Connor   WB            9       2.3            14           3.5
## 2 Abbey Holmes     ADEL         35       4.4            38           4.8
## 3 Aimee Schmidt    GWS          21       3              17           2.4
## 4 Ainslie Kemp     MELB         21       5.3             9           2.3
## 5 Akec Makur Chuot FRE          29       4.8             8           1.3
## 6 Alex Williams    GWS          47       6.7            20           2.9
## # ... with 39 more variables: Disposals_TOT <int>, Disposals_AVG <dbl>,
## #   Cont_Poss_TOT <int>, Cont_Poss_AVG <dbl>, Uncont_Poss_TOT <int>,
## #   Uncont_Poss_AVG <dbl>, `Disp_eff_%` <dbl>, Clangers_TOT <int>,
## #   Clangers_AVG <dbl>, Marks_TOT <int>, Marks_AVG <dbl>,
## #   Cont_marks_TOT <int>, Cont_marks_AVG <dbl>, Marks50_TOT <int>,
## #   Marks50_AVG <dbl>, `Hit-outs_TOT` <int>, `Hit-outs_AVG` <dbl>,
## #   Clearances_TOT <int>, Clearances_AVG <dbl>, Frees_For_TOT <int>,
## #   Frees_For_AVG <dbl>, Frees_Agst_TOT <int>, Frees_Agst_AVG <dbl>,
## #   Tackles_TOT <int>, Tackles_AVG <dbl>, `One_%s_TOT` <int>,
## #   `One_%s_AVG` <dbl>, Bounces_TOT <int>, Bounces_AVG <dbl>,
## #   Goals_TOT <int>, Goals_AVG <dbl>, Behinds_TOT <int>,
## #   Behinds_AVG <dbl>, Goal_assists_TOT <int>, Goal_assists_AVG <dbl>,
## #   `Goal_acc_%` <dbl>, Matches <int>, Time_On_Ground_prop <dbl>,
## #   Year <int>
kable(head(players))
Player Club Kicks_TOT Kicks_AVG Handballs_TOT Handballs_AVG Disposals_TOT Disposals_AVG Cont_Poss_TOT Cont_Poss_AVG Uncont_Poss_TOT Uncont_Poss_AVG Disp_eff_% Clangers_TOT Clangers_AVG Marks_TOT Marks_AVG Cont_marks_TOT Cont_marks_AVG Marks50_TOT Marks50_AVG Hit-outs_TOT Hit-outs_AVG Clearances_TOT Clearances_AVG Frees_For_TOT Frees_For_AVG Frees_Agst_TOT Frees_Agst_AVG Tackles_TOT Tackles_AVG One_%s_TOT One_%s_AVG Bounces_TOT Bounces_AVG Goals_TOT Goals_AVG Behinds_TOT Behinds_AVG Goal_assists_TOT Goal_assists_AVG Goal_acc_% Matches Time_On_Ground_prop Year
Aasta O’Connor WB 9 2.3 14 3.5 23 5.8 12 3.0 12 3.0 65.2 8 2.0 4 1.0 0 0.0 2 0.5 24 6 0 0.0 1 0.3 3 0.8 6 1.5 6 1.5 0 0.0 1 0.3 0 0.0 1 0.3 100 4 73.6 2017
Abbey Holmes ADEL 35 4.4 38 4.8 73 9.1 51 6.4 27 3.4 52.1 17 2.1 9 1.1 4 0.5 2 0.3 0 0 5 0.6 8 1.0 2 0.3 16 2.0 5 0.6 0 0.0 2 0.3 2 0.3 2 0.3 40 8 64.5 2017
Aimee Schmidt GWS 21 3.0 17 2.4 38 5.4 13 1.9 23 3.3 55.3 8 1.1 15 2.1 1 0.1 3 0.4 0 0 0 0.0 1 0.1 3 0.4 9 1.3 5 0.7 0 0.0 3 0.4 0 0.0 0 0.0 50 7 82.4 2017
Ainslie Kemp MELB 21 5.3 9 2.3 30 7.5 18 4.5 12 3.0 50.0 6 1.5 8 2.0 5 1.3 3 0.8 0 0 3 0.8 1 0.3 2 0.5 8 2.0 0 0.0 0 0.0 0 0.0 2 0.5 1 0.3 0 4 63.7 2017
Akec Makur Chuot FRE 29 4.8 8 1.3 37 6.2 20 3.3 16 2.7 48.6 8 1.3 2 0.3 1 0.2 0 0.0 6 1 5 0.8 0 0.0 2 0.3 13 2.2 11 1.8 1 0.2 0 0.0 0 0.0 0 0.0 0 6 64.9 2017
Alex Williams GWS 47 6.7 20 2.9 67 9.6 35 5.0 23 3.3 59.7 10 1.4 6 0.9 0 0.0 0 0.0 0 0 4 0.6 8 1.1 1 0.1 21 3.0 14 2.0 1 0.1 0 0.0 1 0.1 1 0.1 0 7 88.6 2017
#View(players)
  • What type of variable is Club?
players$Club[1:10]
##  [1] "WB"   "ADEL" "GWS"  "MELB" "FRE"  "GWS"  "BL"   "GWS"  "COLL" "FRE"
  • What type of variable is Kicks_TOT?
players$Kicks_TOT[1:10]
##  [1]  9 35 21 21 29 47 29  7 63  4
  • What type of variable is Kicks_AVG?
players$Kicks_AVG[1:10]
##  [1] 2.3 4.4 3.0 5.3 4.8 6.7 3.6 1.8 9.0 1.3

Summarise and display Categorical Variables: contingency tables and barplots

  • Use contingency tables to summarise categorical variables
table(players$Club)
## 
## ADEL   BL CARL COLL  FRE  GWS MELB   WB 
##   57   58   59   58   62   60   58   57
  • Use barplot to plot catorical variables

A plot 2-dimensional barplot usually contains a set of labels on the x-axis corresponding to the categories of the variable and on the y-axis is the number of times each category of the variable appears in the dataset.

Compare the following two plots:

  • geom_bar() is used to produce the barplot
  • theme_bw() is purely aestethic and simply adds a white background
  • What does fill=Club do?
  • What does coord_flip() do?
library(ggplot2)
ggplot(data = players,aes(x=Club,fill=Club)) + geom_bar() + theme_bw()

ggplot(data = players,aes(x=Club,fill=Club)) + geom_bar() + theme_bw() + coord_flip()

Discrete and continuous variables: summary statistics, histograms ans boxplots

Discrete and continuous variables are usually summarised and displayed using similar tools. Often, discrete variables can be seen as special case of continuous variables.

From summary statistics to histogram

  • Which one of the following summary stastics do you prefer for the number of total kicks?
  • Are you familiar with the concepts of mean, median, quantiles?
table(players$Kicks_TOT)
## 
##   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17 
##  40   5   3   9  10  11   9   9  13   6  10  10   4   3   9   7   5   6 
##  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35 
##   8   6   5   9   7   6   8   5   4   5   8   9   7   4   5   6   8  10 
##  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53 
##   9   7   5  10   5  10   3   8   6   8  10   6   8   4   3   2   7   2 
##  54  55  56  57  58  59  60  61  63  64  65  66  67  68  69  71  72  73 
##   4   1   1   4   1   2   4   3   2   2   2   1   4   2   1   1   2   1 
##  74  75  76  78  79  81  82  84  85  86  87  89  91  96  97 101 102 105 
##   1   2   1   1   2   1   3   1   2   1   2   2   1   2   1   1   2   1 
## 123 124 
##   1   1
summary(players$Kicks_TOT)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   10.00   28.00   30.23   44.00  124.00
  • Use histograms or boxplots to plot continuous/discrete variables
ggplot(data = players,aes(x=Kicks_TOT)) + geom_histogram(colour="white") + theme_bw()

# Alternative for continous variables: Boxplot
ggplot(data = players,aes(x="Tot kicks",y=Kicks_TOT)) + geom_boxplot() + theme_bw()

# alternative way of producing a boxplot
boxplot(players$Kicks_TOT)

Summarise and plot continuous variables by levels of a categorical variables

For example, summarise plot the number of total kicks per AFL team

  1. Create a table containing the number of total kicks kicked by a team per each year and save it into a new object kicks_by_team
library(dplyr)
kicks_by_team <- players %>% group_by(Year,Club) %>%
summarise(Tot.kicks = sum(Kicks_TOT))
kicks_by_team
## # A tibble: 16 x 3
## # Groups:   Year [?]
##     Year Club  Tot.kicks
##    <int> <chr>     <int>
##  1  2017 ADEL       1052
##  2  2017 BL          977
##  3  2017 CARL        780
##  4  2017 COLL        838
##  5  2017 FRE         817
##  6  2017 GWS         758
##  7  2017 MELB        911
##  8  2017 WB          706
##  9  2018 ADEL        906
## 10  2018 BL         1077
## 11  2018 CARL        757
## 12  2018 COLL        959
## 13  2018 FRE         850
## 14  2018 GWS         848
## 15  2018 MELB        894
## 16  2018 WB         1050
  1. Per Club, plot number of kicks.
ggplot(data = players,aes(x = Club, y = Kicks_TOT)) + geom_bar(position="dodge",stat="identity") + theme_bw() + facet_wrap(~Year)

# Add title
ggplot(data = players,aes(x = Club, y = Kicks_TOT,fill=factor(Year))) + geom_bar(position="dodge",stat="identity") + theme_bw() + ggtitle("Total kicks by club by year (2017-1018)")

# Flip coordinate and colour by year
ggplot(data = players,aes(x = Club, y = Kicks_TOT,fill=factor(Year))) + geom_bar(position="dodge",stat="identity") + theme_bw() + ggtitle("Total kicks by club by year (2017-1018)") + coord_flip()

# Plot total number of goals instead of kicks
ggplot(data = players,aes(x = Club, y = Goals_TOT,fill=factor(Year))) + geom_bar(position="dodge",stat="identity") + theme_bw() + ggtitle("Total kicks by club by year (2017-1018)") + coord_flip()

Explore the relashionship between two discrete variables: Scatterplot

# Kicks by goal
ggplot(data = players,aes(x = Kicks_TOT, y = Goals_TOT)) + geom_point() + theme_bw() + ggtitle("Total kicks by Total goals (2017-1018)") + coord_flip()

ggplot(data = players,aes(x = Kicks_TOT, y = Goals_TOT)) + geom_point() + theme_bw() + ggtitle("Total kicks by Total goals (2017-1018)") + coord_flip() + facet_wrap(~Year)

# Kicks by handballs
ggplot(data = players,aes(x = Kicks_TOT, y = Handballs_TOT)) + geom_point() + theme_bw() + ggtitle("Total kicks by Total goals (2017-1018)") + coord_flip()

ggplot(data = players,aes(x = Kicks_TOT, y = Handballs_TOT)) + geom_point() + theme_bw() + ggtitle("Total kicks by Total goals (2017-1018)") + coord_flip() + facet_wrap(~Year)

  • What can you say about these plots? Is there a relationship between the number of handballs per player and the number of kicks?

  • An example of interactive plot

library(plotly)

ggplotly(ggplot(data = players,aes(x = Kicks_TOT, y = Handballs_TOT,label=Player,label=Club)) + geom_point() + theme_bw() + ggtitle("Total kicks by Total goals (2017-1018)") + coord_flip() + facet_wrap(~Year))
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`label`)
purl("explore_teams_and_players.Rmd")